Prognostic biomarkers of intracerebral hemorrhage identified using targeted proteomics and machine learning algorithms

Early prognostication of patient outcomes in intracerebral hemorrhage (ICH) is critical for patient care. We aim to investigate protein biomarkers’ role in prognosticating outcomes in ICH patients. We assessed 22 protein biomarkers using targeted proteomics in serum samples obtained from the ICH patient dataset (N = 150). We defined poor outcomes as modified Rankin scale score of 3–6. We incorporated clinical variables and protein biomarkers in regression models and random forest-based machine learning algorithms to predict poor outcomes and mortality. We report Odds Ratio (OR) or Hazard Ratio (HR) with 95% Confidence Interval (CI). We used five-fold cross-validation and bootstrapping for internal validation of prediction models. We included 149 patients for 90-day and 144 patients with ICH for 180-day outcome analyses. In multivariable logistic regression, UCH-L1 (adjusted OR 9.23; 95%CI 2.41–35.33), alpha-2-macroglobulin (aOR 5.57; 95%CI 1.26–24.59), and Serpin-A11 (aOR 9.33; 95%CI 1.09–79.94) were independent predictors of 90-day poor outcome; MMP-2 (aOR 6.32; 95%CI 1.82–21.90) was independent predictor of 180-day poor outcome. In multivariable Cox regression models, IGFBP-3 (aHR 2.08; 95%CI 1.24–3.48) predicted 90-day and MMP-9 (aOR 1.98; 95%CI 1.19–3.32) predicted 180-day mortality. Machine learning identified additional predictors, including haptoglobin for poor outcomes and UCH-L1, APO-C1, and MMP-2 for mortality prediction. Overall, random forest models outperformed regression models for predicting 180-day poor outcomes (AUC 0.89), and 90-day (AUC 0.81) and 180-day mortality (AUC 0.81). Serum biomarkers independently predicted short-term poor outcomes and mortality after ICH. Further research utilizing a multi-omics platform and temporal profiling is needed to explore additional biomarkers and refine predictive models for ICH prognosis.


Introduction
Intracerebral hemorrhage (ICH) comprises 10-15% of all strokes [1], with one-month mortality of 40%, rendering it the deadliest stroke subtype [2].It is a significant healthcare challenge worldwide, and its impact is particularly pronounced in developing nations like India.Limited healthcare resources, disparities in access to care, and unique demographic and epidemiological factors exacerbate the burden of ICH in these regions [3].Understanding factors influencing ICH patient outcomes is crucial for optimizing clinical management, risk stratification, and resource allocation.While various clinical scores exist for predicting functional outcomes and mortality in ICH [4], such as the widely-used ICH score [5], their accuracy for outcomes beyond hospital discharge or 30 days remains uncertain.Hence, there is a critical need for robust prediction models that integrate new predictor variables to improve ICH prognostication [6].Serum biomarkers have emerged as promising candidates with the potential to enhance outcome prognostication in ICH patients [7].Integrating serum biomarkers with clinical variables in prediction models may provide additional prognostic information and guide treatment decisions.
Therefore, we undertook this study to build prognostic models using protein biomarkers to predict poor functional outcomes and mortality in ICH patients within 24 hours of symptom onset utilizing targeted proteomics, regression modeling, and machine learning approaches.

Study sample
We used clinical and proteomics data lodged within a prospective cohort study from a collaborative effort of the Department of Neurology, All India Institute of Medical Sciences (AIIMS), and the Institute of Genomics and Integrative Biology in New Delhi, India.The study database includes consecutive ICH patients aged �18 years, recruited between 04 October 2017 and 20 March 2020 within 24 hours of symptom onset.The details of this study protocol are reported in prior publications [8,9].We obtained written informed consent from all the recruited patients or their legally authorized representatives prior to collecting blood samples and clinical history.The study was approved by the Local Institutional Ethics Committee of AIIMS, New Delhi (Ref.No. IECPG-395/28.09.2017).

Outcomes
We defined poor outcomes as a modified Rankin Scale (mRS) score of 3-6.Our second outcome measure was mortality.The outcomes were ascertained by a researcher blinded to clinical data using telephonic interviews at 90 and 180 days post-ICH.

Blood sample collection
Five ml of peripheral blood sample was taken in serum vacutainer tubes from ICH patients.For serum collection, it was left standing at room temperature for 30 minutes until clotted.It was then centrifuged at 3000g for 10 minutes, after which the serum was separated into cryovials.Five aliquots of each sample (100μl) were prepared and stored at -80˚C until further analysis.

Sample preparation
Ten μl of serum samples were used for protein precipitation.To 90μl of 1X Phosphate Buffer Saline (PBS), 10 μl serum was added and vortex mixed.Protein precipitation was performed using pre-chilled acetone.Briefly, to 100 μl protein extract, four times volume of pre-chilled acetone was added, vortex mixed and centrifuged at 15000 g for 10 minutes at 4˚C.The supernatant was discarded, and the protein pellets were air-dried at room temperature and suspended in 0.1 M Tris-HCl with 8M urea, pH 8.5.Protein quantitation was performed using the Bradford assay.

Reduction, alkylation, and trypsin digestion
A total of 20 μg of protein from each sample was reduced with 25 mM of Dithiothreitol (DTT) for 30 minutes at 60˚C, followed by alkylation using 55 mM of Iodoacetamide (IAA) at room temperature (in the dark) for 30 minutes.These samples were then subjected to trypsin digestion in an enzyme to substrate ratio of 1:10 (trypsin: protein) for 16-18 hours at 37˚C.Finally, the tryptic peptides were vacuum dried in vacuum concentrator.

Peptide selection for multiple reaction monitoring (MRM)-based targeted proteomics
We identified 22 potential biomarkers using a proteomics approach from previously validated data on 300 stroke patients [8].Peptide selection for these 22 proteins was performed using search results from ProteinPilot (SCIEX, USA), PeptideAtlas [10] or in-silico generated peptides of proteins using Expasy PeptideCutter tool [11].Peptides with +2 and +3 charges were considered for MRM and for each peptide, 5-6 fragment ions were used for identification.The list of peptide sequences, protein biomarkers, and their role in ICH pathophysiology is given in S1 Table.

Multiple reaction monitoring (MRM) data acquisition
Tryptic peptides obtained after digestion were desalted using reversed phase cartridges Oasis HLB cartridge (Waters, Milford, MA) according to the following procedure: wet cartridge with 1 × 1,000 μl of 100% acetonitrile, equilibrate with 1 × 1,000 μl of 0.1% formic acid, load acidified digest, washed peptides with 1 × 1,000 μl of 0.1% formic acid and elute with 1 × 1000 μl of 70% acetonitrile in 0.1% formic acid.The peptide mixture was dried using a vacuum centrifuge, and the peptides were resuspended in 0.1% formic acid at a final concentration of 1 μg/μl.A heavy labeled peptide for Apo A1 (QGLLPVLESFK; K = Lysine-13C6,15N2) protein was spiked-in the resolubilized plasma digest at a final concentration of 1 ng/μl.
The targeted MRM-MS [12] analysis of the tryptic peptides was performed on a TSQ Altis (Thermo Fisher, San Jose, CA).The instrument was equipped to an H-ESI ion source.A spray voltage of 3.5 keV was used with a heated ion transfer tube set at a temperature of 325˚C.Chromatographic separations of peptides were performed on Vanquish UHPLC system (Thermo Fisher, San Jose, CA).
The 10 μl of the sample was injected and peptides were loaded on an ACQUITY UPLC BEH C18 column (130Å, 1.7 μm, 2.1 mm X 100 mm, Waters) from a cooled (4˚C) autosampler and separated with a linear gradient of water (buffer A) and acetonitrile (buffer B), containing 0.1% formic acid, at a flow rate of 300 μl/minute in 30 minutes gradient run with the buffer conditions given in S2 Table.
The mass spectrometer was operated in Selected Reaction Monitoring (SRM) mode.For SRM acquisitions, the first quadrupole (Q1) and the third quadrupole (Q3) were operated at 0.7 and 0.7 unit mass resolution, respectively.A dwell time of 6.175 milliseconds (ms) was chosen, and acquisitions occurred over the whole gradient of 30 minutes.Argon was used as the collision gas at a nominal pressure of 1.5 mTorr.Optimized collision energies were used for each peptide.

Bioinformatic and statistical analyses
We analyzed the MRM-based targeted proteomics data using Skyline version 21.1 [13].Peptide areas were spike-in normalized using a heavy labeled peptide for Apo A1 protein and log 2 transformed.Peptides with <10% missing values were imputed using a random-forest-based missing value imputation method.Non-biological experimental variations were removed through batch correction.
Our study adhered to the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines [14].We performed univariable logistic regression analysis using odds ratio (OR) and 95% confidence interval (CI) to assess poor outcomes during follow-up.Mortality rates were evaluated through Kaplan-Meier survival curves, and we constructed simple Cox proportional hazard models with hazard ratios (HR) and 95% CI.We conducted receiver operating characteristic (ROC) curve analyses and determined optimal cut-off points for each biomarker using the Youden Index (sensitivity + specificity-1).
Regression-based ICH outcome prediction models.We developed prediction models to determine independent predictors of poor outcomes and mortality in ICH.Variables with pvalue <0.1 in the univariable analysis, along with demographic variables like age and sex, were included in a backward stepwise multiple logistic regression or multiple Cox regression analyses.Multicollinearity among predictor variables was assessed using the variance inflation factor (VIF), and predictors with a VIF value exceeding 2.5 were removed from the model.We evaluated the discrimination ability using the area under the curve (AUC) or c-statistic.
Machine learning-based ICH outcome prediction models.We employed a random forest-based machine learning algorithm to identify additional predictors.Categorical data were encoded in binary form, while continuous data were standardized based on the population mean and standard deviation following log-normalization for skewed distributions.Train-test splits were executed in a 7:3 ratio, and random forest models with 1000 estimators were trained using the scikit-learn package.This process was repeated 1000 times, with each iteration involving a new random seed to choose the top 10 variables for prediction.Shapley values were computed using the SHAP package, and absolute means were utilized to evaluate variable importance.
Internal validation.We internally validated our prediction models using bootstrapping and 5-fold cross-validation.
We conducted statistical analyses using STATA software (Version 18) and R version 3.6.2.We conducted interaction network and enrichment analyses of significant biomarkers using Cytoscape version 3.10.0.

Poor outcome at 90-day and 180-day
Of 149 ICH patients, 110 (73.82%) had a poor outcome at 90 days, and of 144 ICH patients, 97 (67.36%) had a poor outcome at 180 days.See Table 1 for clinical variables and S3 Table for protein biomarkers significantly associated with poor outcomes in the univariable analysis (p<0.1).

Mortality at 90-day and 180-day
Mortality at 90 days was observed in 62 (41.61%) of 149 ICH patients and at 180 days in 67 (46.53%) of 144 ICH patients.See Table 3 for clinical variables and S4 Table for protein biomarkers significantly associated with mortality in the univariable analysis (p<0.1).

Internal validation of prediction models
Five-fold cross-validation and bootstrapping revealed that prediction models constructed with the random forest algorithm outperformed multivariable logistic and Cox regression in predicting poor outcomes at 180-day and mortality at both time points.However, there was a marginal difference in the mean AUC values of regression models and random forest models for predicting 90-day poor outcomes (Fig 5 and Table 5).

Validation of previous ICH prediction models
We validated previously published ICH prediction scores, including ICH score, MICH score, ICH-FOS score, and ICH-GS score.These scores had AUCs ranging from 0.80 to 0.86 for predicting poor outcomes at 90 days and from 0.76 to 0.84 at 180 days.For mortality prediction, the AUCs ranged from 0.78 to 0.81 at 90 days and from 0.76 to 0.80 at 180 days.Adding biomarkers to the prediction models in this study improved the prediction of poor outcomes compared to previous models, but no difference was noted in models predicting mortality (Table 6).

Interaction network and enrichment analyses
In univariable analysis, we analyzed a protein network of 10 biomarkers linked to poor outcomes and mortality post-ICH.This network featured ten biomarkers (nodes) with eight interactions (edges), including six highly connected biomarkers.MMP-9 displayed the highest degree of interaction, connecting with four other proteins.In our network, 7 out of 8 interactions had a score exceeding 0.80, with the most robust interaction score of 0.96 between MMP-2 and IGFBP-3 (Fig 6).The significant pathways encompassed negative regulation of catalytic activity, protein metabolic processes, extracellular space, and extracellular matrix disassembly (S5 Table ).

Discussion
Our analysis reaffirms the relevance of established clinical variables in predicting poor outcomes and mortality in ICH, including age, GCS score, ICH volume, NIHSS score, and various  laboratory investigations, underscoring their significance in clinical practice [5,19].Furthermore, our analyses of 22 protein biomarkers and clinical features noted in 24 hours post-ICH revealed UCH-L1, alpha-2-macroglobulin, Serpin A11, and MMP-2 as independent predictors of poor outcome.IGFBP-3 and MMP-9 independently predicted mortality following ICH.Machine learning-based random forest models identified additional predictors, including haptoglobin for poor outcomes and UCH-L1, APO-C1, and MMP-2 for mortality prediction.
Integrating protein biomarkers to clinical prediction models may enhance the precision of risk assessment and resource allocation in stroke prevention programs [20], promoting more efficient resource allocation targeting individuals most likely to develop ICH.By incorporating biomarkers that reflect underlying pathophysiological processes associated with ICH, the predictive models can provide more nuanced insights into an individual's risk of developing ICH and likelihood of experiencing poor outcomes after ICH.This enhanced precision allows healthcare providers to tailor interventions and treatment strategies more effectively based on the patient's specific risk profile.This targeted approach optimizes the use of resources and ensures that interventions are directed toward individuals with the greatest need, maximizing the overall effectiveness of stroke prevention programs.
Our study reveals inconsistent optimal values for sensitivity and specificity in biomarker cutoffs from univariable analyses (S3 and S4 Tables).This emphasizes the need to consider multiple biomarkers and their cutoff values for improved accuracy and reliability in ICH outcome prediction.This study, conducted in a tertiary care center in India, holds particular significance due to the diverse demographic, genetic, and environmental factors unique to the Indian population.Understanding stroke in this context contributes to a more holistic understanding of the disease and its multifaceted risk factors, which is vital as India and many other developing nations face an increasing burden of stroke [21].These findings can be pivotal in shaping stroke treatment and outcome prognostication strategies unique to India and similar resource-limited settings.
UCH-L1's association with poor outcomes in ICH suggests its potential as an early neurological damage marker [22].Alpha-2-macroglobulin's role in predicting poor outcomes in ICH, previously unexplored, may relate to its involvement in protease inhibition and inflammation regulation [23].Serpin A11's anti-inflammatory and anti-fibrotic properties may signify a response to mitigate damage after ICH [24].IGFBP-3's downregulation in ICH could reflect a compromised injury response [25].The role of APO-C1 and haptoglobin in ICH prognostication, reported for the first time in our study, requires further investigation.
MMP-2 and MMP-9, implicated in tissue remodeling and inflammation [26], highlight extracellular matrix dynamics in ICH pathogenesis [27].Indeed, our network and enrichment  analyses underscored MMP-9's central role in ICH pathophysiology (Fig 6), particularly its involvement in extracellular matrix-related pathways influencing ICH outcomes (S5 Table) [27].MMP-9's association with short-term mortality but not poor functional outcomes aligns with previous Indian population findings [28], suggesting potential for MMP-9 inhibition as a neuroprotective strategy in ICH [29].Prediction models are limited by the risk of overfitting to the training data, potentially compromising their generalizability.Cross-validation and bootstrapping address this by assessing model performance across different data subsets and testing stability through resampling.We, therefore, internally validated our findings through five-fold cross-validation and bootstrapping, demonstrating that random forest models consistently outperformed regression models in predicting 180-day poor outcome and 90-day and 180-day mortality, as evidenced by higher AUC values (Table 5).This suggests that machine learning algorithms, with their ability to capture complex interactions and patterns, offer a valuable tool for enhancing prognostic accuracy in ICH [30,31].
We also validated the performance of previously published prediction scores for ICH outcomes (Table 6).Our study, demonstrated higher AUC values for poor outcome predictions compared to the existing prediction scores.This suggests that integrating novel protein biomarkers and clinical features in our model enhances the precision of risk assessment for ICH outcomes.
The choice of outcomes in this study emphasizes the importance of patient-reported outcome measures (PROMs) and quality of life in stroke research and care [32].While many clinical trials prioritize mortality as the primary outcome, it's essential to acknowledge that survival alone may not fully represent the patient's overall well-being or quality of life [33].Rankin 4-5 indicates significant disability, and for many patients, this level of impairment can be as debilitating as death itself.Rankin 3 signifies moderate disability that can significantly affect a patient's independence and quality of life.By consolidating mRS scores 3-6, our study essentially encompasses all patients experiencing death or significant disability.
Compared to previous studies, our study has several strengths.Firstly, we utilized a targeted proteomics approach, specifically multiple reaction monitoring, for precise and sensitive measurement of protein biomarkers.Secondly, we recruited patients within the 24-hour window, a critical period for stroke management, allowing earlier biomarker assessment than prior studies [28,34].Thirdly, our study had minimal loss to follow-up (<5%).However, our study also has several limitations.Firstly, the small sample size warrants external validation of these biomarkers in adequately powered studies.However, we internally validated our dataset and obtained consistent AUCs across the outcome measures.Secondly, our study only included patients from a single center, limiting its generalizability.Thirdly, focusing on a 24-hour time window may not capture the full ICH progression, necessitating earlier biomarker measurements [35,36] and exploring temporal changes.Lastly, we provided relative quantification data for protein biomarkers, suggesting the need for obtaining absolute quantification values in future studies.

Conclusion
These data reflect outcomes in developing nations underscoring the potential of serum biomarkers, in conjunction with clinical variables, to enhance outcome prediction in ICH patients.Biomarkers like UCH-L1, alpha-2-macroglobulin, Serpin A11, MMP-2, IGFBP-3, and MMP-9 showed strong associations with outcomes, improving model accuracy.With better performance of random forest-based machine learning models, proteomic data holds promise for ICH prognostication.Future research should examine temporal profiles of these biomarkers in larger cohorts and explore additional pathways using multi-omics platforms to refine predictive models for ICH prognosis.